OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents
نویسندگان
چکیده
The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. Information extraction from semistructured documents has been studied extensively recently. Most researches focus on supervised learning approaches where targets must be labelled in the training set. Information extraction with unlabelled training set is hard and only works for multi-record documents. This paper introduces OLERA – OnLine Extraction Rule Analysis to the rapid generation of IE systems that can extract structured data from semistructured Web documents with unlabelled training set. In this novel framework, extraction rules can be trained not only from a multiple-record Web page but also from multiple single-record Web pages (called singular pages). Evaluation results show a high level of extraction performance for both singular pages and multi-record pages.
منابع مشابه
OLERA: A Semi-supervised Approach for Web Data Extraction with Visual Support
Information extraction (IE) from semi-structured Web documents plays an important role for a variety of information agents. Over the past few years, researchers have developed a rich family of generic IE techniques based on supervised approaches which learn extraction rules from user-labelled training examples. However, annotating training data can be expensive when thousands of data sources ne...
متن کاملRule-Based Information Extraction for Structured Data Acquisition using TextMarker
Information extraction is concerned with the location of specific items in (unstructured) textual documents, e.g., being applied for the acquisition of structured data. Then, the acquired data can be applied for mining methods requiring structured input data, in contrast to other text mining methods that utilize a bag-of-words approach. This paper presents a semi-automatic approach for structur...
متن کاملHierarchical Concept Description and Learning for Information Extraction
This paper addresses the problem of extracting information from textual documents, either normal documents or web pages. A new approach for extracting complicate information from semi-structured documents is introduced that exploits a successive hierarchical rule-learning algorithm. Through evaluation it is shown that this approach can extract complicate concepts with a much higher precision th...
متن کاملSpace characters in Chinese semi-structured texts
Space characters can have an important role in disambiguating text. However, few, if any, Chinese information extraction systems make full use of space characters. However, it seems that treatment of space characters is necessary, especially in cases of extracting information from semi-structured documents. This investigation aims to address the importance of space characters in Chinese informa...
متن کاملDiploma Thesis Analysis and Comparison of Existent Information Extraction Methods
Information extraction is initially applied for identification of desired information from natural language documents and conversion of the extracted text into a self-defined presentation. With the rapidly increasing amount of available information sources and electronic documents on the World Wide Web, information extraction is extended for identification from structured and semi-structured we...
متن کامل